A Study of the Data Provided by the Donors Choose Organization

by Steve Henle Hazel John Jim Schlough

Overview and Motivation:

Funding the contemporary K-12 classroom is greatly challenging and many teachers believe that resources provided are insufficient in meeting the most basic objectives. Nontraditional support is playing an increasingly important part in supporting the modern classroom.

The Donors Choose web platform provides a mechanism of providing support to teachers by benefactors. A potential donor may read an appeal written by a teacher to fulfill a specific classroom material need, and donates towards their funding goal. If the funding goal is met, these materials are sourced by fulfillment sources under the control of the Donors Choose organization, and sent directly to the school.

Given the changes in public sentiments and commitments towards financial support of the community school over the past few decades, the ability to raise funds directly into the classroom might come to be considered a vital skill of the teacher in supporting classroom activities.

When written appeals succeed by becoming fully funded, or fail by expiring, the data surrounding the appeal is gathered and made publicly available. By analyzing this data, it might be possible to better understand the factors correlated with success or failure of a written appeal. Some factors such as location, date and time, and poverty level are beyond the control of a teacher. Other factors, such as the written content of an appeal, or to a lesser extent, the credentials of the teacher, can be controlled.

We wish to apply statistical analysis of the available data in an effort to go beyond axiomatic and aesthetic beliefs regarding what makes a more or less effective funding proposal.

back to top

Initial Questions:

What questions are you trying to answer? How did these questions evolve over the course of the project? What new questions did you consider in the course of your analysis? What makes the difference between a proposal that is funded and one that expires? Which are the winning qualities? Of the predicting qualities, are there any that are under the control of the writer? Does the content of the written essay matter?

back to top

Data:

We are using the data made publicly available on the Donors Choose website. This publicly available comes in the form of downloadable csv files, ranging in size from megabytes to gigabytes. This data can be found at here.

This is a suggested schema for the recomposition of their open dataset:

back to top

Data Wrangling

back to top

Step 1: Data Download & Cleanup

Our first task was to download the data, clean it up and extract the data we needed. We decided to do the analysis with just the data from 2014, so the final task was the filter out unneeded data.

back to top

rm_old: Our First Data Wrangling Utility

Our first plan was to use a combination of AWK and SED to do the necessary data cleanup, but time & date fields proved to be problematic. After an evening of steady efforts along those lines, a C++ data cleaning application was written as a stop gap measure. In this way we were able to separate the 2014 projects records from the csv file and get the rest of the team started with the data. This is the source code for the intermediate c++ application, named rm_old as it removed the records older than 2014 and also 2015 or newer.

back to top

c++ Source code for rm_old

//
//  main.cpp
//  rm_old
//
//  Created by Jim Schlough on 4/22/16.
//  Copyright © 2016 Jim Schlough. All rights reserved.
//

#include <iostream>
#include <fstream>    // for ifstream, ofstream

#include <string>
#include <ctime>
#include <cstdlib>
#include <stdio.h>    // for tmpnam, remove

// for time & date processing:
#include <sstream>
#include <locale>
#include <iomanip>

using namespace std;

int main(int argc, const char * argv[]) {

    // insert code here...
    if (argc< 3 )
    {
        std::cout << "Usage rm_old fileInName bottomCutOffDate topCutOffDate dateFieldIndex" << endl;
        std::cout << endl;
        std::cout << "   dateFieldIndex is ONE based" << endl;
    }
    
    char filebuf [L_tmpnam];
    ::strcpy(filebuf, argv[1]);
    
    std::string outFileName;

    int dateFieldIdx = 0;
    dateFieldIdx = std::atoi(argv[4])-2;
    
    
    // TODO: check for clean cutOffDateInput here
    int64_t bottomCutOffDateValue = 0L, topCutOffDateValue = 0L;
    bottomCutOffDateValue = std::atol(argv[2]);
    topCutOffDateValue = std::atol(argv[3]);
    
    // TODO: check for valid (positive integer) date field index (1 based) here
    

    std::ifstream inputFile (filebuf, std::ios::in);
    outFileName.append(filebuf);
    outFileName.erase( outFileName.find(".csv"),4);
    outFileName.append("_output.csv");
    
    std::ofstream outputFile (outFileName, std::ios::out);
    std::string line, submittedDateTimeStr, submittedDateStr;
    bool skipFirst = true;
    
    if (inputFile.is_open())
    {
        std::getline(inputFile, line);
        skipFirst = (line.find('\"') == std::string::npos); // first line is  header
        
        
        while( inputFile)
        {
            if (skipFirst)
            {
                skipFirst = false;
                outputFile << line << endl;
            }
            else std::getline(inputFile, line);
            
            if (line.length() < 2) continue;
            size_t numberCommas = std::count(line.begin(), line.end(), ',');
            if (numberCommas < 43 ||
                line.find("\"") == std::string::npos ) // skip the header line, which has no "
                continue;
            
            // find the position of the date in the 41st field
            int x = 0;
            //std::string::size_type
            int lastPos=0, startOfDatePos = 0, endOfDatePos = 0;
            int64_t dateIntValue = 0L;
            
            // TODO: make magical 39 to be dateFieldIndex in future refinement
            
            while (x<43   &&  inputFile.good() ) {
                lastPos = (int)line.find(',', lastPos+1);
                if (x== (dateFieldIdx)) // date we seek is in the 41st field
                {
                    startOfDatePos = lastPos+2;
                } else if (startOfDatePos != 0)
                {
                    endOfDatePos = (int)line.find(',', startOfDatePos)-1; // ", is end of field, so -1 for " part
                    break;
                }
                x++;
            }
            
            submittedDateTimeStr = line.substr(startOfDatePos, endOfDatePos-startOfDatePos );  ///19);
            
            // truncate the hours, minutes and seconds off of the date
            submittedDateStr = submittedDateTimeStr.substr(0, submittedDateTimeStr.length()-9 );
            while(submittedDateStr.find('-') != std::string::npos )
                submittedDateStr = submittedDateStr.erase( submittedDateStr.find('-'), 1);
            dateIntValue = std::atol(submittedDateStr.c_str());//, std::locale("en_US.utf-8"));
            
            if (dateIntValue <= bottomCutOffDateValue || dateIntValue >= topCutOffDateValue)
                continue; // skip to the next record if this one is too early or too late
            
            if (outputFile.is_open())
                outputFile << line << endl;
            else
                exit(EXIT_FAILURE);
        }
        inputFile.close();
        outputFile.close();
    }
    return 0;
}

back to top

Data Loader redone in R

The c++ application made with the source code above was disadvantageous in that it would only run on Macintosh computers, used by 2 out of 3 group members. So this solution was set aside, in favor of the solution presented below, so everyone could run the same dataloader.

We also saved the final data sets to submit as part of the project.

This is only run once and not evaluated after that since we can read data from the filtered data files directly.

# Create function to write data frame to zipped rds file
# The dataframe is split into smaller files depending on size
writeToDisk <- function(df, path) {
  # get the size of the data frame
  filesz = object.size(df)
  
  # Figure out if it needs to be split, we try to 
  # split into sizes ~ 250MB (before compression)
  numsplits = filesz %/% (150*1024*1024)
  
  # Split into subsets and write to disk in RDS format
  # so that we can preserve attritubes including type
  if (numsplits > 1) {
    # Split the dataframe into "numsplits" subsets
    df_split <- split(df, ntile(df$`_projectid`, numsplits))
    
    cat("Writing", numsplits, "files with prefix", path, "\n")
    
    # Save data to separate rds files
    # Wrap loop inside invisible() since we are not interested in
    # the return values
    invisible(lapply(names(df_split), function(x) {
      
      write_rds(df_split[[x]], paste0(path, x, "of", numsplits, ".rds.gz"),
                compress = "gz")
    }))
  }
  else {
    
    cat("Writing 1 file with prefix", path, "\n")
    write_rds(df, paste0(path, ".rds.gz"), compress = "gz")
  }
}

# Create function that download file of type "kind", removes special 
# characters and loads the data
retrieveData <- function(kind, needs_cleanup) {
  
  # Create the download link
  url <- paste0("https://s3.amazonaws.com/open_data/csv/opendata_",
                  kind, ".zip")
     
  # Create the path to download the file to           
  zipname <- paste0("data/opendata_", kind, ".zip")
  
  # Create the filename
  filename <- paste0("opendata_", kind, ".csv")
  
  cat("Downloading from", url, "...")
  
  # Download the file
  download.file(url, zipname)
  
  # Donations, resources and essays data files needed cleanup with
  # special characters, escaped characters etc. creating read errors.
  # Data cleanup was done using sed as a system call after
  # realizing that using pipe() to run sed from R was slow.
  # NOTE: The sed script was created on MacOS and might not be portable.
  # Tried to run sed inside pipe - scan(pipe(sed_cmd), sep = ",") 
  # but had too many issues with needing to use multiple escaped characters
  # Also tried readlines() followed by gsub() but the performance was poor.
  if (needs_cleanup) {
    # cleanup is needed so unzip, run sed and then read in data
    
    # unzip the file
    unzip(zipname, filename)
    
    # Create a sed command to clean out special characters
    sed_cmd <- paste0("sed -i '' -f ", kind,
                      "_clnup.sed ", filename)
    
    cat("Running data cleanup for", filename, "...")
    
    # Run the sed command
    system(sed_cmd)
    
    cat("Loading", kind, "...")
    
    # Read in the data
    assign(kind, read_csv(filename), envir=globalenv())
    
    # Remove files
    unlink(zipname)
    unlink(filename)
  }
  else {
    cat("Loading", kind, "...")
    
    # cleanup is not needed, so read in data directly
    assign(kind, read_csv(unz(zipname, filename)), envir=globalenv())

    # Remove zip file
    unlink(zipname)
  }
}

# Create the list the type of data files we want to download
types_list = c("projects", "resources", "donations", "essays")
  
# Note which files need cleanup
needs_cleanup = c(FALSE, TRUE, TRUE, TRUE)

# Download files, remove special characters and load data
for (index in seq(1:4)) {
  retrieveData(types_list[index], needs_cleanup[index])
}

# Convert dates to "Date" format
projects <- projects %>%
  mutate(date_posted = as_date(date_posted),
         date_completed = as_date(date_completed),
         date_thank_you_packet_mailed =
           as_date(date_thank_you_packet_mailed),
         date_expiration = as_date(date_expiration))

donations <- donations %>%
  mutate(donation_timestamp = as_date(donation_timestamp))

# Filter out projects that were posted in 2014
projects <- projects %>% filter(year(date_posted) == 2014)

# Select resources, donations and essays associated with
# 
resources <- resources %>%
  semi_join(projects, by = "_projectid")
donations <- donations %>%
  semi_join(projects, by = "_projectid")
essays <- essays %>%
  semi_join(projects, by = "_projectid")

# Save filtered data to disk
writeToDisk(df=projects, path="data/opendata_2014_projects")
writeToDisk(df=resources, path="data/opendata_2014_resources")
writeToDisk(df=donations, path="data/opendata_2014_donations")
writeToDisk(df=essays, path="data/opendata_2014_essays")

# Let us clean all the variables so as to be able to start 
# with a clean slate
rm(projects, resources, donations, essays, 
   types_list, needs_cleanup, retrieveData, writeToDisk)

# Cleanup memory
gc()

back to top

Step 2: Data Upload from disk

This is where we would start after the initial data retrieval. We need to load the data from the RDS files in the data folder

# Create function to load data from the rds files containing the
# name "kind".
uploadData <- function(kind) {
  temp <- list.files(path = "./data", 
                    pattern = paste0(".*", kind, ".*rds.gz"),
                    full.names = TRUE)
  # Read in the data
  tables <- lapply(temp, read_rds)
  
  # Combine multiple (or single) dataframes into one and return
  return (bind_rows(tables))
}

# Read in the different data sets
projects <- uploadData("projects")
resources <- uploadData("resources")
donations <- uploadData("donations")
essays <- uploadData("essays")

back to top

Exploratory Analysis:

We began an exploration of the data to see what relationships might be discovered within it, to compare the completed and expired projects.

## [1] "Data is nowhere near normal, looks like logistic analysis of funded vs non-funded make much more sense."
## [1] "Will not look at different factors contained in the projects file to determine if they affect the likliehood of getting funded. There are techinically three outcomes for each request. Complete, means reached or succeeded funding goal. Expired, time ran out wihtout reaching goal. Reallocated, Did not reach goal, but donors chose to give previously pledge amount to a different proposal."

## Warning: Removed 6 rows containing non-finite values (stat_smooth).

back to top

Text Analysis

Word Selection Analysis
# Tokenize the essay and remove stop words and include only
# all alphabetic words. All words are lower case, so there is
# no need to transform
essays_tokenized <- essays %>%
  select(`_projectid`, `_teacherid`, essay) %>%
  unnest_tokens(essay_words,essay)  %>%
  filter(!essay_words %in% stop_words$word &
           grepl("^[[:alpha:]]*$", essay_words))

# Get the sentiment lexicon from mrc
nrc <- sentiments %>%
  filter(lexicon == "nrc") %>%
  select(word, sentiment)

# Assign sentiments to words
essays_sentiments <- essays_tokenized %>%
  left_join(nrc, by = c("essay_words" = "word"))

# Include the funding_status of the projects
essays_sentiments <- essays_sentiments %>%
  left_join(projects, by = "_projectid") %>%
  select(`_projectid`, funding_status, essay_words, sentiment)

# Count the sentiment frequency
essays_sentiment_freq <-
  essays_sentiments %>%
  group_by(funding_status, sentiment) %>%
  summarise(sentiment_freq = n()) %>%
  group_by(funding_status) %>%
  mutate(occurance_pct = sentiment_freq*100/sum(sentiment_freq)) %>%
  ungroup()


# Plot the sentiment frequency and seperate by funding status
essays_sentiment_freq %>%
  filter(funding_status != "reallocated") %>%
  ggplot(aes(x=sentiment, y = occurance_pct, fill = sentiment)) +
  geom_bar(stat="identity") +
  facet_grid(~funding_status) +
  theme(text = element_text(size = 10),
        title = element_text(size = 12),
        legend.key.size = unit(0.5, "cm"),
        axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  ggtitle('Sentiment Occurance% in Funding Essays') +
  coord_flip()

# Compute word frequency in essays regardless of sentiment
# or funding_status
essays_word_freq <-
  essays_sentiments %>%
  group_by(essay_words) %>%
  summarise(completed_freq = sum(funding_status == "completed"),
            expired_freq = sum(funding_status == "expired"))

# Plot the top 10 words for both funding status
p1 <- essays_word_freq %>%
  top_n(n=10, wt=completed_freq) %>%
  ggplot(aes(x=reorder(essay_words, completed_freq),
             y = log10(completed_freq))) +
  geom_bar(stat="identity", fill = "blue") +
  ggtitle('Top 10 Words (completed)') +
  theme(axis.text = element_text(size = 8),
        axis.title = element_text(size = 10),
        plot.title = element_text(size = 10)) +
  xlab("essay_words") +
  coord_flip()

p2 <- essays_word_freq %>%
  top_n(n=10, wt=expired_freq) %>%
  ggplot(aes(x=reorder(essay_words, expired_freq), 
             y = log10(expired_freq))) +
  geom_bar(stat="identity",  fill = "green") +
  ggtitle('Top 10 Words (expired)') +
  theme(axis.text = element_text(size = 8),
        axis.title = element_text(size = 10),
        plot.title = element_text(size = 10)) +
  xlab("essay_words") +
  coord_flip()

grid.arrange(p1, p2, nrow=1)

# Students stands out for both "completed" and "expired" projects,
# So create word clouds without it, to see the rest better
essays_word_freq <- essays_word_freq %>%
  filter(essay_words != "students")

# Create word cloud for funded essays
wordcloud(essays_word_freq$essay_words, essays_word_freq$completed_freq,
          min.freq = 10000, max.words=100, random.order=TRUE,
          rot.per=0.35, colors=brewer.pal(8, "Dark2"))

# Create word cloud for expired essays
wordcloud(essays_word_freq$essay_words, essays_word_freq$expired_freq,
          min.freq = 10000, max.words=100, random.order=TRUE,
          rot.per=0.35, colors=brewer.pal(8, "Dark2"))

back to top

Essay Length Analysis: Comparing the word count for completed vs expired projects

# getting some mutually exclusive sets of winning vs losing words
# total word counts of essays by project
essay_total_word_sums <- essays %>%
  select(`_projectid`, essay) %>%
  unnest_tokens(essay_words,essay) %>%  
  group_by(`_projectid`)  %>%
  summarize(word_count= n())  

# just the top records
head( essay_total_word_sums  %>% arrange( desc(  word_count) ), 10)
## Source: local data frame [10 x 2]
## 
##                          _projectid word_count
##                               (chr)      (int)
## 1  d2514b123a2ec329f377dc8b27462b58       1672
## 2  4bd1171a9bdf3553ef873c577661ab76       1309
## 3  2024f2b4273df06b9addcdf15b05ea1c       1069
## 4  0177881c1cf80e786b50a5e59c1356a9       1026
## 5  cee38587c0e3c5f465ded507cf1d2bdc        868
## 6  7b6ffe8110f9a5dc578a4a8d2a836cf8        834
## 7  846ba2cd500c664800f7282a811f0cca        817
## 8  a45494d5a83a74e23db5a2d7ef7069f4        798
## 9  9da21a8db7a816d701061ab6e70965a4        784
## 10 a7d3ebf9a78bd0cf1db66a622c6df720        759
# 2014 project count:
count(projects)$n
## [1] 170326
# separate out the completed projects
completed_projects<- projects %>% filter(funding_status=="completed")

# 2014 completed project count:
count(completed_projects)$n
## [1] 118039
# get the word count sums of the completed projects
completed_total_word_sums <- left_join(completed_projects, essay_total_word_sums, by="_projectid")

# average word sum for a completed project in 2014
mean(completed_total_word_sums$word_count)
## [1] 301.6257
# standard deviation of word count for completed projects of 2014
sd(completed_total_word_sums$word_count)
## [1] 84.44644
# separate out the expired projects
expired_projects<- projects %>% filter(funding_status=="expired")

# 2014 expired project count:
count( expired_projects)$n
## [1] 51246
# get the word count sums of the expired projects
expd_total_word_sums <- left_join(expired_projects, essay_total_word_sums,  by="_projectid")

# number with NA word counts:
count( expd_total_word_sums %>% filter( is.na(word_count) ))$n
## [1] 1
# get rid of the ones with NA word counts: 
expd_total_word_sums<- expd_total_word_sums %>% filter(! is.na(word_count) )

# 2014 expired project count having word counts:
count( expd_total_word_sums)$n
## [1] 51245
# average word sum for a completed project in 2014
mean(expd_total_word_sums$word_count)
## [1] 304.6175
# standard deviation of word count for completed projects of 2014
sd(expd_total_word_sums$word_count)
## [1] 85.73385
Essay word count comparison results
## [1] "For the year 2014, from a total of 118039 essays: "
## [1] "170326 essays were from completed projects and 118039 essays were from expired projects. "
## [1] "All essays had a mean length of 302.4543 and a standard deviation of 84.8453"
## [1] "Completed essays had a mean length of 301.6257 and a standard deviation of 84.4464"
## [1] "Expired essays had a mean length of 304.6175 and a standard deviation of 85.7339"
## [1] "On average, completed essays had essay word counts that were 2.9918 shorter than expired ones"

Final Analysis:

What did you learn about the data? How did you answer the questions? How can you justify your answers?